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V  Abstract  WA  I  I 

This  paper  describes  a  method  for  local  optimization  of  VLSI  leaf  cells,  using 
the  parameterized  procedural  layout  language  ALLENDE  [5].  Tradeoffs  among 
delay  time,  power  consumption,  and  area  are  illustrated.  Three  different  imple¬ 
mentations  of  the  1-bit  full  adder  ars  compared:  a  random  logic  circuit,  a  data 
selector,  and  a  PLA.  The  fastest  random  logic  1-bit  full  adder  has  a  time-power 
product  about  1/3  that  of  the  fastest  data  selector,  and  about  1/4  that  of  the 
fastest  PLA.  The  4-bit  parallel  adder  is  used  to  illustrate  the  effect  of  loading 
when  leaf  cells  are  combined. 

1.  Introduction  / 

In  the  design  of  a  custom  VLSI  chips  it  often  happens  that  there  is  one  cell 
that  is  used  many  times,  usually  in  an  array  or  a  recursive  structure.  The  fact 
that  a  cell  is  used  many  times  means  that  there  is  a  large  potential  payoff  in  its 
optimization,  and  that  the  problem  can  be  made  small  enough  to  be  manage¬ 
able.  Arrays  of  cells  are  especially  common  in  digital  signal  processing  applica¬ 
tions.  where  regular  structures,  like  systolic  arrays,  lead  to  designs  that  are 
easy  to  lay  out  efficiently,  and  have  high  throughput  As  examples,  bit-parallel 
and  bit-serial  multipliers  can  be  constructed  from  one-  and  two-dimensional 
arrays  of  one-bit  full  adders,  as  can  a  wide  variety  of  pipelined  FIR  and  IIR  filters 
(see  [l],  for  example).  As  another  example,  a  processor  for  updating  one¬ 
dimensional  cellular  automata  has  been  designed  at  Princeton  which  consists  of 
a  one-dimensional  array  of  5-input/  1-c utput  PLA's  [10].  In  such  cases  the  prob¬ 
lem  of  making  most  efficient  use  of  a  given  piece  of  silicon  breaks  down  into  two 
distinct  problems:  1)  choice  of  the  global  packing  strategy  (the  method  of  laying 
out  and  interconnecting  leaf  cells,  and  connecting  them  to  power  and  clocks), 
and  2)  the  design  of  the  iterated  structure  itself  (which  we  cedi  the  leaf  cell).  In 
this  paper  we  study  the  second  problem:  the  design  of  efficient  leaf  cells.  The 
example  used  throughout  is  the  most  common  in  digital  signal  processing,  the 
1-bit  full  adder. 

There  are  three  important  measures  of  how  good  a  leaf  cell  is:  its  time 
delay  T;  its  peak  or  average  power  dissipation  Pmu  or  P**;  and  its  area  A. 

^  Thi«  work  was  supported  by  National  Science  Foundation  Grants  ECS -6307955,  U.S.  Army 
Army  Research  Office,  Durham,  NC,  under  Grant  DAA&29-82-X-0096,  DAKPA  Contract  N000I4- 
82- X- 0549,  and  ONR  Grant  N00014-83-K-0275 
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Ideally,  the  designer  should  be  able  to  trade  off  these  measures,  one  against  the 
other.  For  example,  in  one  application  the  clock  may  be  fixed  at  a  known  value 
7o,  and  it  would  therefore  be  senseless  to  make  the  the  cell  faster.  On  the  other 
hand,  peak  power  may  be  a  real  constraint  because  of  heat  dissipation  limita¬ 
tions,  and  at  the  same  time  it  may  be  important  to  keep  the  area  small  so  as  to 
fit  as  many  cells  on  one  chip  as  possible.  We  might  therefore  try  to  minimize 
some  measure  of  the  peak  power  and  area  (the  product,  for  example),  while 
enforcing  the  constraint  T  ss  T0.  In  other  applications  speed  may  be  critical,  and 
it  may  be  important  to  minimize  T  while  observing  constraints  on  Pp  and  A,  and 
so  on.  In  general,  we  would  like  to  have  enough  information  about  the  tradeoffs 
among  the  measures  T,  P  and  A  to  make  intelligent  design  decisions.  As  we  will 
see,  the  P-T  tradeoff  is  often  of  most  interest,  since  the  area  is  often  a  less  sen¬ 
sitive  function  of  design  parameters  (at  least  for  fixed  topology). 

2.  Formulation 

The  basic  approach  we  take  will  be  to  search  for  local  improvements  on  ran¬ 
dom  initial  designs.  The  search  strategy  will  be  to  consider  all  single  or  double 
changes  in  element  size  along  the  critical  path.  When  only  single  changes  are 
tried,  we  call  the  prt  cedure  "1-change",  when  double  changes  are  tried,  ”2- 
change”.  The  idea  is  that  the  critical  path  indicates  which  parameters  are  most 
important  to  performance  at  any  given  point  in  the  analysis. 

We  will  limit  the  optimization  to  choice  of  pulldown  widths.  The  method  can 
be  extended  to  choice  of  layers,  orientation,  and  topologies.  We  will,  however, 
study  three  radically  different  topologies  for  the  hill  adder:  the  PLA,  data- 
selector,  and  random  logic. 

The  main  analysis  tools  used  in  these  experiments  are  the  timing  simulator 
CRYSTAL,  and  the  power-estimation  program  POWEST,  together  with  the  rest  of 
the  Berkeley  tool  package  [2]. 

Another  essential  component  of  the  work  is  a  procedural,  constraint-based 
layout  language  for  specifying  VLSI  layouts;  in  this  case,  we  used  the  new 
language  ALLENDE  being  developed  at  Princeton,  a  successor  to  ALI2  and  CLAY 
[3,4,5].  This  allows  us  to  specify  circuit  parameters  and  have  a  cifplot  generated 
automatically. 

3.  The  Critical-Path  Optimization 

Figure  1  shows  how  the  optimization  is  performed  in  our  experiments.  In 
Figure  1  fapann  is  an  input  parameter  vector  to  ANALYSIS  which  has  diffusion 
widths  of  nodes  as  described  in  section  4.  The  initial  faparm  is  generated  at  ran¬ 
dom  by  RANDOM  according  to  its  input  file  pattern.  ANALYSIS  takes  fapann  as 
its  input  and  generates  an  appropriate  layout  and  its  resulting  T,  P,  and  A.  as 
well  as  the  nodes  on  the  critical  path  (  hereafter  called  the  critical  path  nodes 
).  Since  every  node  on  the  critical  path  has  an  associated  parameter  in  fapann, 
GASEGEN  can  generate  faparms  as  subcases  by  using  the  one-(two-)change 
method.  Here  the  one-(two-)change  method  changes  one(two)  parameter(s) 
associated  with  the  critical  path  nodes  by  one  step.  (  From  here  on  the  1- 
cbange  method  is  denoted  by  1-opt  or  Random  1-opt,  and  the  Z-change  method 
by  2-opt  or  Random  2-opt.  ) 

The  optimization  strategy  is  shown  in  the  flowchart  of  figure  2.  When  the 
first  improvement  occurs,  this  case  is  picked  up  for  the  next  iteration.  If  no 
improvement  occurs  but  there  exists  a  case  which  has  the  same  cost  and  has 
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not  yet  been  analyzed,  this  case  is  adopted  next.  Otherwise  a  new  random 
faparm  is  generated  for  the  next  iteration,  to  search  for  other  locally  optimal 
points.  We  used  two  cost  criteria  for  optimization:  T,  and  Pm.»7*  (hereafter 
denoted  by  FT).  Figure  3  shows  an  outline  of  the  main  procedures  used  in  the 
ANALYSIS  loop.  A  short  description  of  each  follows  below: 

1)  AUjENDE  This  procedural  constraint-based  VLSI  layout  language  pro¬ 

duces  an  integrated  circuit  layout  in  Caltech  International 
Form  (CIF)  corresponding  to  the  specified  parameters  [5]. 

2)  MEXTRA  MEXTRA  reads  CIF  and  extracts  the  nodes  to  create  a  circuit 

description  for  further  analyses  [2]. 

3)  CRYSTAL  CRYSTAL  is  used  for  finding  the  worst-case  delay  time  of  the 

circuit  [2]. 

4)  POWEST  POWEST  is  used  for  finding  the  average  and  maximum  power 

consumption  of  the  circuit. 

5)  CRITICAL  CRITICAL  reports  the  critical  path  nodes  by  using  the  output 

of  CRYSTAL. 

6)  11ST  This  command  stores  the  vector  of  results  (T.P.A)  in  the  HIS¬ 

TORY  file  for  further  optimization. 

In  figure  3  the  squares  surrounded  by  dotted  lines  are  files  used  for  inputs 
or  outputs  of  the  above  procedures. 

1)  faparm  The  faparm  has  parameters  for  layout  genera¬ 

tion;  for  example,  the  diffusion  width  of  each 
node,  the  permutation  of  product  terms  in  a 
PLA,  etc. 

2)  layout  generating  program  There  are  several  ALLENDE  programs  imple¬ 

menting  desired  circuit  topologies  such  as  the 
PLA,  random  logic,  etc.  Each  program  requires 
parameters  in  its  corresponding  faparm. 

3)  the  critical  path  nodes  The  critical  path  nodes  are  extracted  from  the 

output  of  CRYSTAL  Each  node  can  be  associ¬ 
ated  with  parameters  in  faparm.  This  is  done  by 
looking  up  a  table  for  each  topology,  which 
associates  each  node  with  its  corresponding 
parameter. 

4.  Full-Adder  Circuit  Implementations 

As  mentioned  in  the  Introduction,  we  adopted  the  1-bit  full-adder  circuit  as 
an  example  for  experimentation,  because  it  is  relatively  simple,  but  is  a  basic 
arithmetic  logic  circuit.  The  1-bit  full-adder  circuit  can  be  implemented  in 
many  ways.  We  chose  three  kinds  of  circuits:  the  PIA  Data  Selector,  and  Ran¬ 
dom  logic.  Each  layout  has  several  parameters.  We  will  use  the  vector  represen¬ 
tation  of  these  parameters;  that  is  d  ~  (  dj.dz . d*  )  means  that  the 

diffusion  width  of  node  i  is  d+X.  We  also  use  the  vector  k  ~  ( kj.fcj . )  to 

mean  that  the  pullup  to  pulldown  ratio  of  the  inverter,  NOR,  or  NAND  circuit  in 
which  node  i  exists  is  A* .  The  vector  k  is  fixed  for  each  circuit. 

1)  PLA 


-3- 


Figure  4  shows  the  full-adder  circuit  diagram  implemented  by  a  programmable 
logic  array  (PLA)  [7].  This  layout  has  the  following  17  parameters  and  2  permu¬ 
tations. 

^  “  (  ^*ndj . ■  •  •  •  ) 

k  =  (  4, 4,4, 4.4, 4,4, 4, 4.4. 4, 4, 4, 4,4, 4,4  ) 

-  7  pulldown  diffusion  widths  of  the  AND  plane. 

-  2  pulldown  diffusion  widths  of  the  OR  plane. 

-  8  pulldown  diffusion  widths  for  inputs. 

*  2  pulldown  diffusion  widths  for  outputs. 

-  1  permutation  of  product  terms  in  the  AND  plane. 

-  1  permutation  of  outputs. 

In  the  optimization  process,  the  two  permutations  are  fixed  for  the  sake  of  sim¬ 
plicity.  However  those  two  permutations  are  chosen  in  advance  in  order  to  give 
the  best  result  before  the  optimization  by  doing  experiments  based  on  various 
random  permutations  as  inputs. 

2)  Berkeley  PLA 

The  PLA  generated  by  using  mkpla  of  the  Berkeley  VLSI  tools  [2,8]  is  used  for 
the  purpose  of  cost  comparison  with  the  PLA  implemented  in  1).  This  PLA  is  not 
optimized,  but  uses  the  following  fixed  parameter  vector. 
d  =  (  4, 4,4, 4, 4, 4, 4, 4, 4,8,8, 9, 8,8, 8, 8, 8  ) 
k  =  (  4, 4.4, 4, 4, 4, 4.4,4, 4, 4, 4,4,4, 4, 4, 4  ) 

3)  Data  Selector 

Figure  5  shows  the  full-adder  circuit  diagram  of  a  Data  Selector  implementation 
[9].  The  following  truth  table  is  used. 

Ci  B  S  C0 

0  A  Cl  (or  B) 

1  A  A 

0  A  A 

1  A  G  ( or  B) 

This  circuit  selects  inputs  (  A,  A,  or  Q  )  instead  of  calculating  S  and  C0.  Here  Q 
is  the  input  carry  signal,  Q,  is  the  output  carry  signal,  and  5  is  the  output  sum 
signal.  A  and  B  denote  the  two  other  inputs.  This  layout  has  the  following  8 
parameters. 

d  =  (  dA.dg.dq.dt.dz.d&dC'.ds  ) 
k  =  (  4, 4,4,4, 8, 4, 8,8  ) 

-  3  pulldown  diffusion  widths  for  input  inverters. 

-  3  pulldown  diffusion  widths  for  internal  inverters. 

-  2  pulldown  diffusion  widths  for  output  inverters. 

4)  Random  Logic 

Figure  6  shows  the  circuit  diagram  of  the  Random  Logic  Implementation  [6]. 
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This  layout  has  the  following  4  parameters. 
d  -  (  ) 

k  =(8,12.4.4) 

-  2  pulldown  diffusion  widths  for  internal  inverters. 

-  2  pulldown  diffusion  widths  for  output  inverters. 

All  the  circuits  above  were  verified  by  ESDI  [2]  or  SIMULATE  [5]. 

5.  Parameterization 

The  diffusion  width  of  the  pullup  in  each  stage  is  automatically  determined 
and  implemented  by  ALLENDE  in  the  following  way.  Suppose  that  the  current 
parameter  vector  is  d  =  (  dhd2 . dn  )  •  and  the  pullup-to-pulldown  ratio  vec¬ 

tor  of  the  specified  layout  is  ib  =  (kl,k2,  .  .  .  ,1 fc*  ) .  (The  choice  of  pullup-to- 
pulldown  ratio  is  discussed  in  [7].)  For  each  node  t,  define  the  variables  Zpu,  Zp 
and  a  pullup-to-pulldown  ratio  K  as  follows. 

^ K’i£ 


where 

Lpu  (  Lf*t  )  is  the  length  of  pullup  (pulldown). 
Wpu  (  Wjtf  )  is  the  width  of  pullup  (pulldown). 
Vpd-di,  K-kiand  1^  =  2. 

Lp „  and  are  determined  as  follows. 

If  Wjn^ZK 


Lp u/2 

2/»W 


or 


If  Wpt  >  ZK 


Wpu=  Wj*  /  K 

ir  _  ^pu 

K~  2/  Wpt 


or 


Lpu  = 


ZKWpv 


We  adopted  following  choices. 

1)  X  =  2m 

2)  The  timing  estimation  program  CRYSTAL  uses  an  input  pulse  which  is  1  nsec 
wide. 


6.  Results 

Table  1  shows  a  comparison  of  the  performance  of  our  implementations. 
Each  row  represents  one  locally  optimal  point  using  as  criterion  the  itr.m  indi¬ 
cated  by  •.  The  units  of  A,  P ■*.  Pm,  T,  APT  and  PT  are  X8.  (1J-8  •  W), 
(10~«  •  fig t  (10-12  •  X8  •  W  •  ns)  and  (10-*  *  W  •  ns)  respectively  in  all 

tables.  Figure  7  shows  Fm..  vs  T  curves  for  different  topologies,  whil  s  figure  8 
shows  several  Ptatx  vs  T  trajectories  obtained  during  the  process  of  op  timization 
using  the  1-change  and  2-change  methods  for  the  Data  Selector  and  the  Random 
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Table  1.  performance  comparison  (1  bit  full  adder) 


type 

A 

Pmm 

p 

‘  m*i 

T 

APT 

PT 

parameter 

PLA 

21580 

8472 

10183 

12.8* 

2802 

1303 

1) 

21840 

5878 

9241 

15.3* 

3087 

1413 

2) 

21762 

5503 

8618 

14.9* 

2794 

1284 

3) 

PLA(Berkeley) 

22178 

7314 

11749 

12.8* 

3339 

1504 

4) 

Data  Selector 

8100 

3765 

6117 

15.8* 

783 

966 

88888888 

8100 

3529 

5645 

16.5* 

754 

931 

88848888 

8190 

3784 

8116 

15.9* 

796 

972 

12  8888888 

Random  Logic 

7742 

1331 

1957 

16.5* 

392 

323 

16  12  3  2 

9600 

1883 

2427 

16.4* 

382 

398 

16  24  2  3 

9800 

1844 

2329 

18.4* 

378 

382 

16  24  2  2 

9600 

1723 

2506 

16.5* 

397 

413 

16  24  3  3 

5194 

705 

1096 

22.6 

128 

248* 

6  8  2  2 

4704 

826 

1018 

25.9 

124 

264* 

4  6  32 

5136 

744 

1174 

22.9 

138 

269* 

8  8  23 

1)  d  =  (  4, 4. 4. 4, 4, 4, 3, 4, 4, 8, 8, 8. 4, 4, 4, 8, 2  ) 

2)  d  =  (  4, 2, 3, 3, 3, 3, 3,4, 3, 8,8, 8, 4, 4, 4.8, 2  ) 

3)  d  =  (  3,3, 3,4, 4, 4, 4, 3, 3, 8, 8. 8,4, 4, 4, 4, 3  ) 

4)  d  -  (  4, 4, 4,4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8  ) 


Table  2  performance  comparison  (4  bit  parallel  adder) 


type 

A 

P mi* 

P  mm.-w 

T 

APT 

PT 

parameter 

Data  Selector 

41310 

16536 

282 IB 

75.3* 

877761 

212482 

4  8  8  B  16  8  16  16 

4455G 

16536 

28218 

84.1* 

1057230 

237313 

4  8  B  8  16  8  24  16 

45409 

18534 

28213 

84.3* 

1079990 

287836 

48  8  16  16  16  16  16 

44523 

13248 

21641 

91.0* 

878805 

196933 

4888  16  484 

42845 

12301 

19748 

92.5* 

782645 

182669 

4844  16  884 

43747 

11362 

17868 

94.9* 

741806 

169567 

4844  18  484 

43605 

12354 

20692 

98.0* 

884229 

202782 

2848  16  848 

45441 

11885 

19753 

100.8* 

904777 

199110 

2848  16  844 

44523 

12305 

19755 

101.1* 

889227 

199723 

4848  16  448 

44649 

11831 

18808 

103.2* 

866631 

194099 

4844  16  844 

43747 

11362 

17868 

103.8* 

809812 

185112 

4844  16  448 

Random  Logic 

35552 

6577 

10335 

41.1* 

151014 

42476 

16  12  8  2 

34848 

8734 

10649 

41.4* 

153634 

44087 

16  12  8  3 
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Logic  circuit.  Each  point  takes  about  1.5  minutes  of  cpu  time  on  a  VAX  11/750. 
Many  of  the  locally  optimal  solutions  have  identical  parameter  values  on  the 
critical  path,  but  differ  in  other  coordinates  because  of  different  random  start¬ 
ing  values.  \ 

7.  Parallel  Adder :  The  effect  of  loading  factors  \ 

The  preceding  results  did  not  take  the  loading  on  tike  output  of  the  circuit 
into  account.  When  these  circuits  are  used  in  arrays,  th\s  may  become  impor¬ 
tant.  To  study  this  problem,  we  implemented  two  circuits  for  a  4-bit  parallel 
adder,  using  the  Data  Selector  and  the  Random  Logic  1-bit  full  adders  of  the 
previous  section.  The  results  are  shown  in  Table  2.  , 

8.  Discussion  of  Results  ^ 

I 

8.1.  P~...  vs  T  tradeoff 

Figure  8  shows  P*.-.-T  trajectories  followed  by  the  critical  path  optimiza¬ 
tion  process,  when  minimizing  T  for  the  Random  Logic  jcircuit.  The  dotted 
envelope  shows  the  final  tradeoff  curve  for  P  vs  T.  Notice  that  the  locally 
optimal  point  obtained  by  using  PT  as  the  cost  criterion  lie)j3  very  close  to  the 
trajectory  obtained  when  minimizing  T.  (  See  point  a.  with!  P  =  12.5 mW,  and 
T  -  22.4ns.)  For  comparison,  the  optimization  for  PT  gave  us  a  locally  optimal 
point  b  with  P  ~  10.9mJy  and  T  =  22.6ns,  very  close  to  point!  a  Thus,  optimiza¬ 
tion  using  the  two  criteria  is  consistent.  ; 

i 

8.2.  Performance  comparison  among  the  PLA.  Data  selector,  and  Random  logic. 
Table  3  normalized  performance  comparison  (1-bit  full  adder) 


type 

A 

Pav. 

P  max 

T 

APT 

PT 

Random  Logic 

100 

100 

100 

100 

100 

100 

Data  Selector 

105 

283 

313 

96 

200 

299 

PLA 

278 

486 

520 

78 

715 

403 

PLA(Berkeley) 

286 

550 

600 

78 

852 

466 

Table  3  shows  a  normalized  performance  comparison  of  the  best  locally 
optimal  point  for  each  layout,  minimizing  T.  The  Random  Logic  seems  to  be  the 
best  choice  in  all  respects  except  T.  However,  it  is  the  fastest  among  the  4-bit 
parallel  adder  implementations.  The  T  of  the  4-bit  parallel  adder  using  Random 
Logic  is  less  than  4  times  the  T  of  the  1-bit  full  adder,  while  in  the  other  layouts 
it  is  more  than  4  times  the  T  of  the  1-bit  full  adder.  The  reason  is  that  this  Ran¬ 
dom  Logic  1-bit  full  adder  circuit  calculates  the  carry  signal  and  propagates  it 
before  the  calculation  of  the  sum  signal,  so  the  carry  ripple  propagates  faster 
than  the  sum.  As  a  result,  the  4-bit  parallel  adder  takes  only  2.5  times  as  much 
time  as  the  1-bit  full  adder.  Figure  7  shows  the  P-T  tradeoff  curve  of  each  lay¬ 
out.  The  curve  for  the  Random  Logic  circuit  is  below  the  one  for  the  Data  Selec¬ 
tor,  which  is  below  that  for  the  P1A  Hence  we  can  order  the  layouts  with  Ran¬ 
dom  logic  best,  Data  Selector  next,  and  PLA  last.  This  result  agrees  with  our 
intuition  because  this  order  is  the  same  as  the  order  of  circuit  specialization. 

8.3.  Comparison  between  our  PLA  and  the  Berkeley  PLA 

Both  PLA's  have  almost  the  same  costs,  except  for  P.  The  reason  is  that  our 
locally  optimal  point  occurs  at  the  choice  d  =  (4, 4, 4, 4,4, 4, 3, 4, 4, 8, 8, 8, 4, 4,4, 3,2), 
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while  the  Berkeley  PIA  adopts  d  =  (4.4,4.4,4,4.4.4,4,B,8.8,8,B,B.8,8).  The  Berke¬ 
ley  PLA  is  therefore  very  close  to  locally  optimum  with  respect  to  T. 

8.4  Comparison  with  Myers'  work 

Myers  did  similar  performance  comparisons  of  various  1-bit  full  adder 
implementations  [9],  but  did  not  use  any  optimization.  His  results,  shown  in 
Table  4  below,  are  quite  different  from  ours,  shown  in  Table  3.  Our  results  show 
that  an  appropriate  choice  of  layout  and  its  optimization  makes  the  Random 
Logic  circuit  better  than  the  Data  Selector,  and  that  the  PLA  can  be  made  very 
fast  at  the  expense  of  Power. 

Table  4.  1-bit  full  adder  normalized  performance  comparison  (Myers [9]) 


type 

A 

PfOMX 

T 

APT 

PT 

Random  Logic 

100 

100 

100 

100 

100 

Data  Selector 

45 

50 

125 

28 

72.5 

PLA 

105 

110 

170 

196 

187 

8.5.  4-bit  Parallel  Adder 

Tables  1  and  2  show  that  the  locally  optimal  point  of  the  1-bit  full  adder  is 
attained  with  a  pull-down  diffusion  width  of  the  carry  output  stage  dco  =  2  or  3, 
while  the  corresponding  width  for  the  4-bit  parallel  adder  is  d ^  =  8.  The  pullup 
width  remains  2.  This  suggests  that  the  critical  path  passes  through  the  pull¬ 
down  of  the  output  carry  stage,  which  is  indeed  the  case. 

On  the  hand,  for  the  Data  Selector,  the  critical  path  passes  through  the 
pullup  of  the  output  carry  stage,  and  in  fact  it  is  the  pullup  width  that  expands 
during  optimization  of  the  4-bit  parallel  adder. 

B.8.  Comparison  of  the  1-change  and  2-change  methods 

Figure  8  and  Table  5  show  a  comparison  between  the  1-change  and  the  2- 
change  methods  when  applied  to  the  Random  Logic  implementation.  Table  5  is 
discussed  in  the  next  section.  The  slope  of  the  2-change  method  is  steeper  than 
that  of  the  1-change  method,  but  the  2-change  method  reaches  better  locally 
optimal  points.  Hence  in  this  case  the  2-change  method  works  better  than  the 
1-change  method  does.  However,  the  2-change  method  does  not  work  as  well  as 
the  1-change  method  for  the  Data  Selector,  which  has  many  more  parameters. 
The  2-change  method  took  more  iterations  than  the  l-change  method  and  did 
not  obtain  better  locally  optimal  points. 

8.7.  Effectiveness  of  our  optimization:  Coat  Improvement  ratio 

Table  5  below  shows  the  average  initial  delay  times  T0  (obtained  from  ran¬ 
dom  starts),  the  average  locally  optimal  delay  time  7^#,  the  average  percent 
improvement  of  the  delay  time  T,  and  the  best  locally  optimal  delay  time  Tw 
We  can  see  from  this  that  2-opt  performs  much  better  than  1-opt  We  should  note 
that  it  is  very  important  to  choose  a  good  order  in  which  to  try  improvements, 
because  this  saves  unnecessary  search  time  evaluating  changes  that  are 
unlikely  to  be  improvements.  For  example,  we  chose  the  diffusion  widths  of  the 
3-input  NAND  gate  as  the  first  parameters  tried  for  the  Random  Logic  circuit. 
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Table  5  Cost  improvement  of  our  optimization  methods 


type 

opt 

criterion 

^initial 

I'opt 

%  improvement 

I'm 

Random  Logic 

1-opt 

T 

29.7 

19.2 

33 

19.1 

Random  Logic 

2-opt 

T 

29.7 

16.8 

42 

16.4 

Data  Selector 

1-opt 

T 

24.3 

17.7 

25 

15.8 

Data  Selector 

2-opt 

T 

23.5 

18.0 

23 

15.8 

PLA 

1-opt 

T 

19.3 

16.3 

16 

12.8 
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